I. Introduction: Contextualizing the King County House Price Prediction Model

Problem Statement and Objective

In the vibrant and diverse King County real estate market, including Seattle’s dynamic environment, property prices are shaped by an array of variables. The primary challenge is to construct a predictive model that can accurately estimate house prices within this area. Utilizing a comprehensive dataset that encompasses diverse house attributes, this model aims to decode the complex mechanisms influencing house pricing.

Purpose of the Model

Predictive Accuracy

The model strives to offer precise price predictions for properties in King County by effectively correlating various house features with their market prices. This aspect is crucial in understanding and quantifying how different characteristics impact the value of a property.

Analytical Insight

A key goal of the model is to unearth and interpret the multitude of factors that play a significant role in determining house prices within the region. This venture goes beyond mere statistical analysis to provide practical, real-world insights, thereby enriching the understanding of real estate dynamics for all stakeholders.

Decision Support

The model is designed to be a powerful asset for a range of users, including real estate agents, prospective buyers, and sellers. By offering accurate price predictions and deep market insights, it aids in making informed and strategic decisions in the property market.

Scope and Methodology

Data Preprocessing and Exploration

Initial data preparation is vital to ensure accuracy in the model. This stage involves cleansing the data, converting data types, and creating dummy variables for categorical features. Following this, an exploratory data analysis (EDA) is conducted to delve into the dataset’s characteristics, examining statistical summaries and relationships between variables.

Feature Selection and Model Assumptions

The process involves using statistical techniques like stepwise regression for feature selection and conducting tests like the Variable Inflation Factor (VIF) and Anderson-Darling to check for multicollinearity and normality, respectively. Additionally, diagnostic plots are used for detecting outliers.

Model Development and Validation

A range of models are employed and assessed:

Linear Models: Including Ordinary Least Squares (OLS) and Weighted Least Squares (WLS).

Regularization Techniques: Such as Ridge, Lasso, and Elastic Net to handle multicollinearity.

Robust Regression: Utilizing Huber’s method to minimize the influence of outliers.

Advanced Models: Exploring alternatives like regression trees, neural networks (NN), or support vector machines (SVM).

Model Performance Evaluation

The model’s effectiveness is evaluated using metrics like RMSE and R-squared, across both the training (70%) and testing (30%) data sets, to ensure its reliability and applicability in real-world scenarios.

Conclusion

This introduction sets the stage for a comprehensive analysis, highlighting the multifaceted approach adopted in this project. From meticulous data preparation to sophisticated modeling, the endeavor is not just to predict house prices accurately but also to provide valuable insights into King County’s real estate market.

II. Description of the Data and Quality

Dataset Overview and Detailed Description

The King County house sales dataset is a comprehensive collection of 21,613 observations, each representing a unique house sale. The dataset encompasses a variety of features that describe different aspects of the houses sold. Below is a detailed description of each variable in the dataset:

Variable Description
id Unique ID for each home sold (not used as a predictor)
date Date of the home sale
price Price of each home sold
bedrooms Number of bedrooms
bathrooms Number of bathrooms, “.5” accounts for a bathroom with a toilet but no shower
sqft_living Square footage of the apartment interior living space
sqft_lot Square footage of the land space
floors Number of floors
waterfront A dummy variable for whether the apartment was overlooking the waterfront or not
view An index from 0 to 4 of how good the view of the property was
condition An index from 1 to 5 on the condition of the apartment
grade An index from 1 to 13 about building construction and design quality
sqft_above The square footage of the interior housing space above ground level
sqft_basement The square footage of the interior housing space below ground level
yr_built The year the house was initially built
yr_renovated The year of the house’s last renovation
zipcode The zipcode area the house is in
lat Latitude coordinate
long Longitude coordinate
sqft_living15 The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 The square footage of the land lots of the nearest 15 neighbors

Data Quality and Transformation

Data Cleaning and Transformation

The dataset’s preparation involved meticulous cleaning and transformation processes to optimize it for accurate predictive analysis. Key steps undertaken include:

  1. Exclusion of Non-Predictive Variables:
    • The id variable, representing a unique identifier for each house sale, does not contribute to predicting house prices and was therefore removed. This step is crucial in focusing the model on variables that influence the outcome (price).
    • Unlike other non-predictive variables, lat (latitude) and long (longitude) were initially retained for their crucial role in calculating geographical distances, which could potentially influence house prices.
  2. Transformation of Data Types:
    • The date variable, initially in a string format, was transformed into a numeric format. This conversion is essential for incorporating the date into statistical models, as numeric representations are more amenable to various types of analysis.
    • For variables like price, sqft_living, sqft_lot, etc., necessary conversions were performed to ensure they are in a suitable numeric format.
  3. Creation of Dummy Variables for Categorical Data:
    • Categorical variables like waterfront, view, condition, and grade were transformed into dummy variables. This transformation is pivotal for regression analysis as it allows these non-numeric variables to be effectively included in the model.
    • The process involved converting these categorical variables into a series of binary variables (0 or 1). This is particularly important for variables like waterfront, which is a binary indicator itself, and for ordinal variables like view and condition, which have intrinsic order but need to be numerically represented for modeling.
  4. Handling Special Cases in Variables:
    • For variables like bathrooms, where values like “0.5” represent bathrooms with a toilet but no shower, the data was kept as is, considering these nuances convey important information about the house’s characteristics.
  5. Grouping and Clustering of Variables:
    • The zipcode variable was transformed by extracting the first three digits, which helps in reducing the number of dummy variables and preventing the model from becoming overly complex while still capturing the geographical influences on house prices.
    • The grade variable was clustered into broader categories to simplify the model and focus on significant differences in construction and design quality.
  6. Haversine Distance Calculation:
    • To incorporate the influence of location more precisely, the Haversine distance was calculated. This involved creating a function to calculate the distance between two geographical points (latitude and longitude) and applying this to our dataset.
    • The calculation of haversine_distance is particularly significant for understanding the spatial relationships and proximity to key locations that might affect house prices.
  7. Calculation of Convergence Point:
    • The dataset was used to identify a ‘convergence point’ – a central point derived from houses with the highest values. This point served as a reference to calculate each property’s distance from a high-value central location, possibly a marker of a desirable area.
    • This step was critical in ensuring that the model accounts for locational desirability without causing data leakage, as it was based solely on the training set.
# Data Preprocessing and Transformation
set.seed(123)  # Setting a seed for reproducibility
split_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_df <- df[split_index, ]
test_df <- df[-split_index, ]

# Remove non-numeric characters from the 'price' column and convert it to numeric
train_df$price <- as.numeric(str_replace_all(train_df$price, "[^0-9.]", ""))
test_df$price <- as.numeric(str_replace_all(test_df$price, "[^0-9.]", ""))

# Calculation of Convergence Point: Determine the convergence point for high-value homes
high_value_threshold <- quantile(train_df$price, probs = 0.90, na.rm = TRUE)  # Calculate the high-value threshold
high_value_homes <- train_df[train_df$price >= high_value_threshold, ]  # Select high-value homes
convergence_point <- c(mean(high_value_homes$lat, na.rm = TRUE), mean(high_value_homes$long, na.rm = TRUE))  # Calculate the convergence point

# Data Transformation Function with Distance Binning Option
transform_data <- function(df, convergence_point, linear_model) {
  # Date Transformation: Convert the 'date' column to a Date object if present
  if ("date" %in% colnames(df)) {
    df$date <- as.Date(substr(as.character(df$date), 1, 8), format="%Y%m%d")
    # Date-Time Feature Engineering: Extract various date-related features
    df$year_sold <- lubridate::year(df$date)
    df$month_sold <- lubridate::month(df$date)
    df$day_sold <- lubridate::day(df$date)
    df$season <- factor(lubridate::quarter(df$date), labels = c("Winter", "Spring", "Summer", "Fall"))
    df$week_of_year <- lubridate::week(df$date)
    df$day_of_year <- lubridate::yday(df$date)
  }
  # Creating Dummy Variables: Convert categorical variables into dummy variables
  df <- df %>%
    mutate(zipcode = as.factor(zipcode),
           waterfront = as.factor(waterfront),
           view = as.factor(view),
           condition = as.factor(condition),
           grade = as.character(grade)) %>%
    dummy_cols(select_columns = c('zipcode', 'view', 'condition', 'grade', 'waterfront', 'season'))
  # Remove last dummy variables to avoid multicollinearity
  if (linear_model) {
    df <- df[, !(names(df) %in% c("zipcode_98199", "view_0", "condition_1", "grade_13", "season_Winter", "waterfront_1"))]
  }
  # Haversine Distance Function: Calculate the distance between two points on Earth's surface
  haversine_distance <- function(lat1, long1, lat2, long2) {
    R <- 6371  # Earth radius in kilometers
    delta_lat <- (lat2 - lat1) * pi / 180
    delta_long <- (long2 - long1) * pi / 180
    a <- sin(delta_lat/2)^2 + cos(lat1 * pi / 180) * cos(lat2 * pi / 180) * sin(delta_long/2)^2
    c <- 2 * atan2(sqrt(a), sqrt(1 - a))
    d <- R * c  # Calculate the haversine distance
    return(d)
  }
  # Calculate Haversine Distance
  df$distance_to_convergence <- mapply(haversine_distance, df$lat, df$long,
                                       MoreArgs = list(lat2 = convergence_point[1], long2 = convergence_point[2]))
  # Remove columns that are no longer needed
  df <- df[, !(names(df) %in% c("id", "date", "zipcode", "view", "condition", "grade", "waterfront", "season"))]
  return(df)
}
# Applying the transformation function to training and test sets
train_df_linear <- transform_data(train_df, convergence_point, linear_model = TRUE)  # Transform the training data for linear models
test_df_linear <- transform_data(test_df, convergence_point, linear_model = TRUE)    # Transform the test data for linear models
train_df_non_linear <- transform_data(train_df, convergence_point, linear_model = FALSE)  # Transform the training data
test_df_non_linear <- transform_data(test_df, convergence_point, linear_model = FALSE)    # Transform the test data

# Set this to TRUE to update all the json model_parameters that are stored the JSON
# Check if the update_model_parameters is TRUE or not
update_model_parameters <- FALSE

# This updates the json with the parameters that were obtained from the intensive process of running
update_model_json <- function(model_name, features, filepath) {
  model_params <- if (file.exists(filepath)) {
                    fromJSON(filepath)
                  } else {
                    list()
                  }
  model_params[[model_name]] <- features
  write_json(model_params, filepath)
}

# # Save the transformed data as CSV files in the current directory
# write.csv(train_df_linear, "train_df_linear.csv", row.names = FALSE)
# write.csv(test_df_linear, "test_df_linear.csv", row.names = FALSE)
# write.csv(train_df_non_linear, "train_df_non_linear.csv", row.names = FALSE)
# write.csv(test_df_non_linear, "test_df_non_linear.csv", row.names = FALSE)

Training Data Header

price bedrooms bathrooms sqft_living sqft_lot floors sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 year_sold month_sold day_sold week_of_year day_of_year zipcode_98001 zipcode_98002 zipcode_98003 zipcode_98004 zipcode_98005 zipcode_98006 zipcode_98007 zipcode_98008 zipcode_98010 zipcode_98011 zipcode_98014 zipcode_98019 zipcode_98022 zipcode_98023 zipcode_98024 zipcode_98027 zipcode_98028 zipcode_98029 zipcode_98030 zipcode_98031 zipcode_98032 zipcode_98033 zipcode_98034 zipcode_98038 zipcode_98039 zipcode_98040 zipcode_98042 zipcode_98045 zipcode_98052 zipcode_98053 zipcode_98055 zipcode_98056 zipcode_98058 zipcode_98059 zipcode_98065 zipcode_98070 zipcode_98072 zipcode_98074 zipcode_98075 zipcode_98077 zipcode_98092 zipcode_98102 zipcode_98103 zipcode_98105 zipcode_98106 zipcode_98107 zipcode_98108 zipcode_98109 zipcode_98112 zipcode_98115 zipcode_98116 zipcode_98117 zipcode_98118 zipcode_98119 zipcode_98122 zipcode_98125 zipcode_98126 zipcode_98133 zipcode_98136 zipcode_98144 zipcode_98146 zipcode_98148 zipcode_98155 zipcode_98166 zipcode_98168 zipcode_98177 zipcode_98178 zipcode_98188 zipcode_98198 view_1 view_2 view_3 view_4 condition_2 condition_3 condition_4 condition_5 grade_3 grade_4 grade_5 grade_6 grade_7 grade_8 grade_9 grade_10 grade_11 grade_12 waterfront_0 season_Spring season_Summer season_Fall distance_to_convergence
475000 4 2.50 2040 16200 2.0 2040 0 1997 0 47.7366 -121.958 2530 15389 2015 3 7 10 66 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 23.544800
316000 4 1.50 2120 46173 2.0 2120 0 1974 0 47.6503 -121.968 2000 46173 2015 5 8 19 128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 19.029290
802000 4 2.25 2130 8734 2.0 2130 0 1961 0 47.5672 -122.161 2550 8800 2014 9 4 36 247 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 6.897144
905000 4 2.50 3330 9557 2.0 3330 0 1995 0 47.5526 -122.102 3360 9755 2015 3 25 12 84 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 11.166806
700000 4 2.25 2440 9450 1.5 2440 0 1947 2014 47.7061 -122.307 1720 7503 2014 10 30 44 303 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 12.037276
178500 3 1.00 900 10511 1.0 900 0 1961 0 47.2883 -122.272 1460 10643 2015 2 12 7 43 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 36.721387

Statistical Analysis and Correlation

Exploratory Data Analysis (EDA)

The exploratory data analysis (EDA) conducted on the King County house sales dataset is an in-depth exploration aimed at uncovering patterns, anomalies, and relationships within the data. This comprehensive EDA includes a variety of analyses to gain a holistic understanding of the dataset’s characteristics.

  1. Distribution Analysis of Continuous Variables:
    • This analysis focuses on continuous variables like price, sqft_living, sqft_lot, and others. Key aspects include examining their distributions, identifying potential outliers, and understanding their range and central tendencies.
    • Histograms and box plots are used to visualize these distributions, which can reveal skewness, kurtosis, and other distributional characteristics important for model assumptions.
  2. Categorical Variable Analysis:
    • The distribution and count of categorical variables such as bedrooms, bathrooms, floors, waterfront, view, condition, and grade are analyzed.
    • Bar plots and frequency tables help in understanding the prevalence of different categories and their potential impact on house prices.
  3. Correlation Analysis:
    • Understanding how continuous variables correlate with each other and, more importantly, with the target variable price.
    • A correlation matrix and corresponding heat map provide a visual and quantitative view of these relationships, highlighting variables that might have a strong positive or negative relationship with house prices.
  4. Temporal Trends Analysis:
    • Analyzing the influence of time-related features such as year_sold, month_sold, and season on house prices.
    • Time series plots and seasonal decomposition can reveal trends, seasonality, and cyclical patterns in house prices.
  5. Geographical Influence Analysis:
    • Investigating the spatial aspect by analyzing the distance_to_convergence variable.
    • Scatter plots or spatial heat maps can illustrate if proximity to the high-value convergence point influences house prices.

Continuous Variable Analysis

This analysis focuses on continuous variables like price, sqft_living, sqft_lot, and others. Key aspects include examining their distributions, identifying potential outliers, and understanding their range and central tendencies.

Price vs. Square Footage of Living Space

In the scatter plot above, we compare the price of homes against their sqft_living (square footage of interior living space). This visualization allows us to explore the relationship between these two variables.

The histogram above displays the distribution of sqft_living. It reveals that the variable is right-skewed, with most homes having smaller living spaces and relatively fewer very large living spaces.

Price vs. Square Footage of Lot

The scatter plot above compares price against sqft_lot (square footage of land space). It helps us understand if there’s any relationship between the size of the lot and the sale price.

The histogram above visualizes the distribution of sqft_lot. Similar to sqft_living, this variable is right-skewed, with most homes having smaller lot sizes and relatively fewer very large lots.

Price vs. Square Footage Above Ground

In the scatter plot above, we compare price against sqft_above (square footage of the interior housing space above ground level). This analysis helps us explore the impact of above-ground living space on home prices.

The histogram above shows the distribution of sqft_above. It suggests that most homes have similar above-ground square footage, with relatively fewer having significantly larger or smaller above-ground spaces.

Price vs. Square Footage of Basement

Excluding homes that do not have a basement.

The scatter plot above compares price against sqft_basement (square footage of the interior housing space below ground level). This visualization helps us understand if the presence and size of a basement influence home prices.

The histogram above visualizes the distribution of sqft_basement. It indicates that most homes have little to no basement space, while some have larger basement areas.

Price vs. Year Built

The scatter plot above compares price against the year when homes were initially built (yr_built). This analysis helps us understand how the age of a home relates to its sale price.

The histogram above displays the distribution of yr_built. It provides insights into the distribution of home ages in the dataset.

Price vs. Year of Last Renovation

Excluding homes that did not have a documented renovation.

In the scatter plot above, we compare price against the year of the last renovation (yr_renovated). This analysis helps us understand whether recent renovations impact home prices.

The histogram above visualizes the distribution of yr_renovated. It provides insights into the distribution of renovation years in the dataset.

Price vs. Distance to Convergence

The scatter plot above compares price against distance_to_convergence. This analysis helps us explore whether the distance to a convergence point impacts home prices.

Categorical Variable Analysis

The distribution and count of categorical variables such as bedrooms, bathrooms, floors, waterfront, view, condition, and grade are analyzed.

Price vs. Bedrooms

The scatter plot above compares price against the number of bedrooms. This visualization helps us understand how the number of bedrooms influences home prices.

The bar plot above displays the distribution of the bedrooms variable, showing the frequency of each bedroom count.

Price vs. Bathrooms

In the scatter plot above, we compare price against the number of bathrooms. This analysis helps us explore the relationship between the number of bathrooms and home prices.

The bar plot above visualizes the distribution of the bathrooms variable, showing the frequency of each bathroom count.

Price vs. Floors

The scatter plot above compares price against the number of floors. This analysis helps us understand how the number of floors in a home relates to its sale price.

The bar plot above displays the distribution of the floors variable, showing the frequency of each floor count.

Price vs. Waterfront

In the scatter plot above, we compare price against the waterfront variable. This visualization helps us explore how having a waterfront view impacts home prices.

The bar plot above visualizes the distribution of the waterfront variable, showing the frequency of waterfront and non-waterfront properties.

Price vs. View

The scatter plot above compares price against the view variable, which represents the quality of the property’s view. This analysis helps us explore how the view quality impacts home prices.

The bar plot above displays the distribution of the view variable, showing the frequency of different view quality ratings.

Price vs. Condition

In the scatter plot above, we compare price against the condition variable, which represents the condition of the property. This analysis helps us explore how property condition relates to home prices.

The bar plot above visualizes the distribution of the condition variable, showing the frequency of different condition ratings.

Price vs. Grade

The scatter plot above compares price against the grade variable, which has been aggregated into categories as per the provided header. This analysis helps us explore how the grade of construction and design impacts home prices.

The bar plot above displays the distribution of the grade_category variable, showing the frequency of different grade categories.

Correlation Analysis

Understanding how continuous variables correlate with each other and, more importantly, with the target variable price.

Top 20 Correlation Values with Price
Variable Correlation with Price
price 1.0000000
sqft_living 0.7055923
grade_category 0.6713220
grade_category_numeric 0.6713220
sqft_above 0.6090981
sqft_living15 0.5872083
bathrooms 0.5337709
view_category 0.4016872
grade_11 0.3693088
grade_10 0.3326084
sqft_basement 0.3278610
bedrooms 0.3141109
view_4 0.3107036
lat 0.3066488
grade_12 0.2927317
waterfront_1 0.2720890
zipcode_98004 0.2685431
floors 0.2593524
grade_9 0.2326972
grade_13 0.2198225

Highly Correlated Variable Pairs
Variable1 Variable2 Correlation

Detailed Explanation for Removal

  1. sqft_above & sqft_living: Both variables are highly correlated because the square footage of the living area above ground (sqft_above) is part of the total square footage of living space (sqft_living). We remove sqft_above as it is likely to contain less unique information than the total living space.

  2. week_of_year, day_of_year, month_sold: These variables are related to the date the house was sold and are thus inherently correlated. day_of_year carries the most granular information, so we might prefer to keep it and remove week_of_year and month_sold which provide more aggregated temporal information.

  3. condition_4 & condition_3: The condition of the house is a categorical variable that has been one-hot encoded. Since these are mutually exclusive categories, they are negatively correlated. We might decide to keep one category as the reference group and remove the others, or revert to the original categorical variable to capture the overall condition in a single variable.

By removing these variables, we aim to reduce multicollinearity, which can distort the estimated regression coefficients, inflate standard errors, and undermine the statistical significance of the predictors. The goal is to retain the variables that provide the unique and informative contribution to the model’s prediction of house prices.

Correlation Graphics Analysis

In the table above, we’ve displayed the top 20 correlation values with the target variable price, sorted by their absolute values. Here are some of the key findings:

  1. Positive Correlations with Price:
    • Variables such as sqft_living, sqft_above, sqft_living15, and bathrooms exhibit strong positive correlations with the target variable. This suggests that as these variables increase, the house price tends to increase as well.
    • Features like grade_11_13, view_4, and grade_8_10 also show positive correlations, indicating that higher-grade properties and better views tend to have higher prices.
  2. Negative Correlations with Price:
    • No negative correlations are present in the top 20. This means that none of the examined features strongly suggest a decrease in price as they increase.
  3. Feature Importance:
    • The strength of the correlations helps us understand the importance of these variables in predicting house prices. Features like sqft_living and grade_11_13 appear to be strong predictors of price.
    • Variables related to location, such as zipcode_98004, zipcode_98039, and zipcode_98040, also have notable positive correlations, indicating the significance of location in price determination.

Geographical Influence Analysis

Investigating the spatial aspect by analyzing the distance_to_convergence variable.

Conclusion

This detailed review of the King County house sales dataset underscores the thorough preparation undertaken for the predictive analysis. The dataset’s diverse variables, both continuous and categorical, have been meticulously processed and analyzed, providing a robust foundation for developing the predictive model. With the comprehensive EDA and graphical analysis, we gain valuable insights into the correlations and distributions within the data, setting the stage for effective model building and accurate house price prediction.

III. Model Development Process

Up to this point, we have successfully conducted an exploratory data analysis (EDA) to gain valuable insights into the dataset. We’ve visualized key features such as price, bedrooms, bathrooms, and more, allowing us to better understand the data’s distribution and relationships. Additionally, we’ve explored various trends, including monthly, seasonal, weekly, and daily trends in both house prices and the count of homes sold. Furthermore, we have cleaned and prepared the data, removing irrelevant variables like id, lat, and long to streamline it for modeling. With these preliminary steps completed, we are now ready to delve into the model development process.

Initial OLS Model

To commence the model development process, we establish an Ordinary Least Squares (OLS) regression model as our baseline. This initial model utilizes the features that have undergone transformation and cleaning during the exploratory data analysis (EDA) phase. To maintain data quality, enhance model performance, and facilitate interpretability, we begin by removing columns introduced in prior graphical iterations. Additionally, we employ the standard data preprocessing practice of dropping columns with missing values (NA) to ensure dataset integrity. This step ensures that our subsequent analyses and models are built upon a robust and complete dataset, minimizing errors and potential biases.

# Fit a linear regression model to the training data
linear_model_initial <- lm(price ~ ., data = train_df_linear)

# Evaluate OLS_linear
df_results <- evaluate_model("OLS_linear", linear_model_initial, train_df_linear, test_df_linear, target_var = 'price', NULL)

# Conditional logic based on the update_model_parameters flag
if (update_model_parameters) {
  # Run each of the stepwise regression models and update the JSON file
  # Step model both
  model_both <- ols_step_both_p(linear_model_initial, pent=0.35, prem=0.05)
  features_both <- setdiff(names(coef(model_both$model)), "(Intercept)")
  update_model_json("OLS_Step_Both", features_both, json_filepath)

  # Step model Backward
  model_backward <- ols_step_backward_p(linear_model_initial, pent=0.35, prem = 0.05)
  features_backward <- setdiff(names(coef(model_backward$model)), "(Intercept)")
  update_model_json("OLS_Step_Backward", features_backward, json_filepath)

  # Step model foward
  model_forward <- ols_step_forward_p(linear_model_initial, pent=0.35, penter = 0.05)
  features_forward <- setdiff(names(coef(model_forward$model)), "(Intercept)")
  update_model_json("OLS_Step_Forward", features_forward, json_filepath)
} else {
  # Load model parameters from JSON and build models
  model_params <- fromJSON(json_filepath)

  # Create models based on the loaded features
  if (all(c("OLS_Step_Both", "OLS_Step_Backward", "OLS_Step_Forward") %in% names(model_params))) {
    # For each model type, create the model using the features stored in the JSON
    model_both <- create_model(train_df_linear, 'price', model_params$OLS_Step_Both)
    model_backward <- create_model(train_df_linear, 'price', model_params$OLS_Step_Backward)
    model_forward <- create_model(train_df_linear, 'price', model_params$OLS_Step_Forward)
  } else {
    stop("Required model parameters are missing in the JSON file.")
  }
}

# Show inital linear model results
summary(linear_model_initial)
## 
## Call:
## lm(formula = price ~ ., data = train_df_linear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1528434   -59681     1549    54607  3641948 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.009e+08  9.861e+06 -10.235  < 2e-16 ***
## bedrooms                -9.988e+03  1.710e+03  -5.842 5.26e-09 ***
## bathrooms                2.294e+04  2.921e+03   7.855 4.27e-15 ***
## sqft_living              1.638e+02  3.455e+00  47.417  < 2e-16 ***
## sqft_lot                 2.172e-01  4.313e-02   5.035 4.84e-07 ***
## floors                  -2.598e+04  3.545e+03  -7.329 2.44e-13 ***
## sqft_basement           -4.937e+01  4.069e+00 -12.132  < 2e-16 ***
## yr_built                -2.751e+02  7.281e+01  -3.779 0.000158 ***
## yr_renovated             2.646e+01  3.287e+00   8.050 8.92e-16 ***
## sqft_living15            1.199e+01  3.225e+00   3.719 0.000201 ***
## sqft_lot15              -8.570e-02  6.711e-02  -1.277 0.201576    
## year_sold                5.186e+04  4.892e+03  10.601  < 2e-16 ***
## day_sold                -2.752e+02  1.462e+02  -1.882 0.059874 .  
## day_of_year              1.462e+02  5.477e+01   2.669 0.007617 ** 
## zipcode_98001           -2.303e+05  1.798e+04 -12.808  < 2e-16 ***
## zipcode_98002           -2.282e+05  1.981e+04 -11.516  < 2e-16 ***
## zipcode_98003           -2.281e+05  1.847e+04 -12.348  < 2e-16 ***
## zipcode_98004            2.840e+05  1.601e+04  17.734  < 2e-16 ***
## zipcode_98005           -1.572e+05  1.777e+04  -8.843  < 2e-16 ***
## zipcode_98006           -1.781e+05  1.329e+04 -13.396  < 2e-16 ***
## zipcode_98007           -1.991e+05  1.919e+04 -10.380  < 2e-16 ***
## zipcode_98008           -1.756e+05  1.539e+04 -11.412  < 2e-16 ***
## zipcode_98010           -1.655e+05  2.393e+04  -6.916 4.84e-12 ***
## zipcode_98011           -2.405e+05  1.597e+04 -15.062  < 2e-16 ***
## zipcode_98014           -1.863e+05  2.028e+04  -9.186  < 2e-16 ***
## zipcode_98019           -2.161e+05  1.793e+04 -12.049  < 2e-16 ***
## zipcode_98022           -1.347e+05  2.465e+04  -5.463 4.76e-08 ***
## zipcode_98023           -2.474e+05  1.784e+04 -13.868  < 2e-16 ***
## zipcode_98024           -1.685e+05  2.297e+04  -7.336 2.31e-13 ***
## zipcode_98027           -1.990e+05  1.371e+04 -14.515  < 2e-16 ***
## zipcode_98028           -2.439e+05  1.503e+04 -16.228  < 2e-16 ***
## zipcode_98029           -1.394e+05  1.437e+04  -9.696  < 2e-16 ***
## zipcode_98030           -2.792e+05  1.695e+04 -16.474  < 2e-16 ***
## zipcode_98031           -2.967e+05  1.578e+04 -18.806  < 2e-16 ***
## zipcode_98032           -2.897e+05  2.037e+04 -14.223  < 2e-16 ***
## zipcode_98033           -8.162e+04  1.387e+04  -5.884 4.10e-09 ***
## zipcode_98034           -2.057e+05  1.289e+04 -15.956  < 2e-16 ***
## zipcode_98038           -2.229e+05  1.581e+04 -14.098  < 2e-16 ***
## zipcode_98039            7.521e+05  2.886e+04  26.062  < 2e-16 ***
## zipcode_98040            4.251e+04  1.540e+04   2.760 0.005779 ** 
## zipcode_98042           -2.688e+05  1.531e+04 -17.556  < 2e-16 ***
## zipcode_98045           -1.005e+05  2.095e+04  -4.796 1.64e-06 ***
## zipcode_98052           -1.753e+05  1.289e+04 -13.596  < 2e-16 ***
## zipcode_98053           -1.620e+05  1.404e+04 -11.539  < 2e-16 ***
## zipcode_98055           -3.187e+05  1.516e+04 -21.019  < 2e-16 ***
## zipcode_98056           -3.090e+05  1.369e+04 -22.567  < 2e-16 ***
## zipcode_98058           -3.154e+05  1.368e+04 -23.056  < 2e-16 ***
## zipcode_98059           -2.980e+05  1.347e+04 -22.120  < 2e-16 ***
## zipcode_98065           -1.895e+05  1.668e+04 -11.362  < 2e-16 ***
## zipcode_98070           -2.733e+05  2.193e+04 -12.461  < 2e-16 ***
## zipcode_98072           -2.133e+05  1.507e+04 -14.155  < 2e-16 ***
## zipcode_98074           -2.186e+05  1.346e+04 -16.240  < 2e-16 ***
## zipcode_98075           -2.175e+05  1.406e+04 -15.463  < 2e-16 ***
## zipcode_98077           -2.308e+05  1.667e+04 -13.846  < 2e-16 ***
## zipcode_98092           -2.448e+05  1.858e+04 -13.176  < 2e-16 ***
## zipcode_98102            6.309e+04  2.083e+04   3.029 0.002459 ** 
## zipcode_98103           -6.282e+04  1.260e+04  -4.987 6.20e-07 ***
## zipcode_98105            4.097e+04  1.549e+04   2.645 0.008181 ** 
## zipcode_98106           -2.635e+05  1.412e+04 -18.659  < 2e-16 ***
## zipcode_98107           -4.169e+04  1.473e+04  -2.829 0.004673 ** 
## zipcode_98108           -2.997e+05  1.701e+04 -17.620  < 2e-16 ***
## zipcode_98109            9.609e+04  1.906e+04   5.040 4.71e-07 ***
## zipcode_98112            1.832e+05  1.523e+04  12.026  < 2e-16 ***
## zipcode_98115           -8.243e+04  1.266e+04  -6.508 7.84e-11 ***
## zipcode_98116           -9.542e+04  1.395e+04  -6.841 8.16e-12 ***
## zipcode_98117           -6.750e+04  1.276e+04  -5.289 1.25e-07 ***
## zipcode_98118           -2.702e+05  1.317e+04 -20.515  < 2e-16 ***
## zipcode_98119            9.741e+04  1.671e+04   5.829 5.70e-09 ***
## zipcode_98122           -1.102e+05  1.524e+04  -7.234 4.90e-13 ***
## zipcode_98125           -1.972e+05  1.344e+04 -14.670  < 2e-16 ***
## zipcode_98126           -1.962e+05  1.382e+04 -14.197  < 2e-16 ***
## zipcode_98133           -2.098e+05  1.306e+04 -16.072  < 2e-16 ***
## zipcode_98136           -1.279e+05  1.496e+04  -8.550  < 2e-16 ***
## zipcode_98144           -1.651e+05  1.440e+04 -11.466  < 2e-16 ***
## zipcode_98146           -2.706e+05  1.452e+04 -18.630  < 2e-16 ***
## zipcode_98148           -2.576e+05  2.540e+04 -10.140  < 2e-16 ***
## zipcode_98155           -2.198e+05  1.342e+04 -16.379  < 2e-16 ***
## zipcode_98166           -2.763e+05  1.527e+04 -18.093  < 2e-16 ***
## zipcode_98168           -3.233e+05  1.513e+04 -21.371  < 2e-16 ***
## zipcode_98177           -1.358e+05  1.483e+04  -9.156  < 2e-16 ***
## zipcode_98178           -3.578e+05  1.515e+04 -23.618  < 2e-16 ***
## zipcode_98188           -3.165e+05  1.813e+04 -17.458  < 2e-16 ***
## zipcode_98198           -3.015e+05  1.642e+04 -18.363  < 2e-16 ***
## view_1                   8.343e+04  9.913e+03   8.416  < 2e-16 ***
## view_2                   7.085e+04  6.141e+03  11.538  < 2e-16 ***
## view_3                   1.339e+05  8.393e+03  15.951  < 2e-16 ***
## view_4                   2.765e+05  1.252e+04  22.084  < 2e-16 ***
## condition_2             -1.470e+04  1.369e+04  -1.074 0.283043    
## condition_4              2.999e+04  3.151e+03   9.517  < 2e-16 ***
## condition_5              7.820e+04  5.001e+03  15.638  < 2e-16 ***
## grade_3                 -1.844e+06  1.175e+05 -15.694  < 2e-16 ***
## grade_4                 -1.950e+06  6.319e+04 -30.862  < 2e-16 ***
## grade_5                 -1.988e+06  5.377e+04 -36.972  < 2e-16 ***
## grade_6                 -1.995e+06  5.239e+04 -38.086  < 2e-16 ***
## grade_7                 -1.990e+06  5.193e+04 -38.327  < 2e-16 ***
## grade_8                 -1.966e+06  5.159e+04 -38.112  < 2e-16 ***
## grade_9                 -1.891e+06  5.123e+04 -36.904  < 2e-16 ***
## grade_10                -1.769e+06  5.104e+04 -34.668  < 2e-16 ***
## grade_11                -1.563e+06  5.108e+04 -30.599  < 2e-16 ***
## grade_12                -1.190e+06  5.321e+04 -22.366  < 2e-16 ***
## waterfront_0            -5.829e+05  1.750e+04 -33.318  < 2e-16 ***
## season_Spring            1.601e+04  5.398e+03   2.967 0.003015 ** 
## season_Summer            6.136e+03  9.202e+03   0.667 0.504908    
## season_Fall             -7.329e+03  1.352e+04  -0.542 0.587774    
## distance_to_convergence -7.774e+03  5.672e+02 -13.706  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147600 on 15024 degrees of freedom
## Multiple R-squared:  0.8401, Adjusted R-squared:  0.839 
## F-statistic:   759 on 104 and 15024 DF,  p-value: < 2.2e-16
view_model_results(df_results)
# Evaluate OLS_Step_Both
df_results <- evaluate_model("OLS_Step_Both", model_both, train_df_linear, test_df_linear, target_var = 'price', df_results)

# Evaluate OLS_Step_Forward
df_results <- evaluate_model("OLS_Step_Forward", model_forward, train_df_linear, test_df_linear, target_var = 'price', df_results)

# Evaluate OLS_Step_Backward
df_results <- evaluate_model("OLS_Step_Backward", model_backward, train_df_linear, test_df_linear, target_var = 'price', df_results)

view_model_results(df_results)

Detailed Initial Model Insights

Model Fit

  • R-squared (0.8404): Approximately 84.04% of the variability in real estate prices is explained by the model, indicating a strong fit.
  • Adjusted R-squared (0.8393): Adjusted for the number of predictors, this value reaffirms the model’s effectiveness.

Coefficient Analysis

  • Significant Predictors: Features like sqft_living, waterfront1, and bedrooms have significant coefficients, implying a notable impact on housing prices.
  • Insignificant Predictors: Variables such as sqft_lot15 show less significance, suggesting a minor influence on price.

Grade Variable Insights

  • Grade Variables: The negative coefficients for grades (grade_3 to grade_12) compared to a baseline grade (the omitted variable) are intriguing. This suggests that higher grades (implying better quality) are associated with lower prices, which warrants further investigation for data inconsistencies or other underlying factors.

Outliers and Residuals

  • Large Residuals: The considerable spread in residuals (from -1518789 to 3645881) may indicate the presence of outliers or non-linear relationships not captured by the model.

Model Diagnostics and Assumptions

Checking Model Assumptions

As we scrutinize our OLS model’s assumptions through diagnostic plotting, we encounter significant insights that warrant our attention and action.

Residuals vs Fitted Plot

The residuals vs fitted plot is our first stop to assess the assumption of linearity. The ideal scenario is a random spread of residuals around the horizontal axis, indicating a linear relationship between predictors and the response.

Observation: A discernible pattern in the residuals suggests that our model may not be capturing some of the non-linear relationships. Points of Concern: Specific points, notably labeled 431, 9381, and 11944, stray from the general cluster, hinting at potential outliers or influential observations.

Normal Q-Q Plot

The Q-Q plot offers a visual comparison of the distribution of residuals against a perfectly normal distribution. Deviations from the diagonal indicate departures from normality.

Observation: The tails of our plot, particularly the right tail, show a clear departure from normality, with several points veering off the expected line. Specific Note: The points at the upper end, again including 431, 9381, and 11944, are especially distant from the line, suggesting they could be outliers with a substantial effect on our model’s performance.

Scale-Location Plot

Also known as the spread-location plot, it’s used to check for equal variance of residuals (homoscedasticity).

Observation: The red line’s upward trajectory indicates an increase in the spread of residuals as the fitted values rise, which is a classic sign of heteroscedasticity.

Residuals vs Leverage Plot

This plot helps us identify influential data points that could unduly sway our model’s predictions.

Observation: Several points fall outside the Cook’s distance lines, with points 3405, 9381 and 11944 appearing as particularly influential to our regression estimates.

Cook’s Distance Plot

The Cook’s distance plot is instrumental in quantifying the influence of each data point.

Observation: The spikes in the plot, notably for points 3405, 9381, and 11944, surpass the typical threshold, suggesting they are disproportionately influencing the model.

Next Steps in Model Diagnostics

Having thoroughly examined the diagnostic plots for our OLS regression model, we have identified several areas requiring intervention to ensure the robustness of our model. Here’s how we propose to address these:

Addressing Non-linearity

The Residuals vs Fitted plot signaled potential non-linearity, which could be corrected by:

  • Integrating Polynomial Terms: We should experiment with adding polynomial terms for continuous predictors to capture any non-linear relationships.
  • Exploring Interaction Effects: If certain predictors interact in a non-additive manner, including interaction terms might offer a better fit.

Stabilizing Variance

The Scale-Location plot indicated the presence of heteroscedasticity, suggesting our model’s residuals have non-constant variance. To mitigate this:

  • Transformation of Response Variable: We could apply a log transformation to the response variable, price, which often stabilizes variance in the presence of heteroscedasticity.
  • Implementing Weighted Regression: A weighted least squares approach could also be employed, giving different weights to observations based on their fitted values.

Normalizing Residuals

The Normal Q-Q plot showed deviations from normality, especially in the tails, prompting us to consider:

  • Applying Transformations: Besides log transformations, other transformations like square root or inverse could be applied to achieve normality in the residuals.
  • Leveraging Non-linear Models: Should transformations fail to yield normally distributed residuals, we might pivot to non-linear models better suited to the data structure.

Managing Influential Observations

The Residuals vs Leverage plot, coupled with the Cook’s Distance plot, highlighted observations with undue influence on our model’s estimates. We’ll approach this by:

  • Investigating High-leverage Points: Points like 3405, 9381, and 11944 require close examination to determine their legitimacy and impact on the model.
  • Considering Removal or Retention: Post investigation, we’ll make informed decisions about removing these points or alternatively, leveraging robust regression techniques to diminish their influence.

Refinement and Validation

As we implement these adjustments, we will:

  • Continually Refine Our Model: Each change brings us closer to a model that accurately captures the underlying data structure.
  • Rigorously Validate Adjustments: It’s vital to validate our model using cross-validation techniques to ensure our changes generalize well to unseen data.

Through these methodical steps, we aim to enhance the reliability and validity of our model’s predictive power, ensuring that it not only meets statistical assumptions but also aligns with empirical data.

Diagnostics for Homoscedasticity and Normality of Errors

Breusch-Pagan Test

We will perform the Breusch-Pagan test to examine if the assumption of homoscedasticity holds in our model. This test will help us understand if the variances of the residuals are constant across levels of the independent variables.

## 
##  Breusch-Pagan test
## 
## data:  linear_model_initial
## BP = 101694, df = 104, p-value < 2.2e-16

The Breusch-Pagan test results with a p-value significantly lower than the alpha level of 0.05 suggests that we reject the null hypothesis of constant error variances in favor of the alternative hypothesis, indicating the presence of heteroscedasticity in our model residuals.

To further assess the normality of our model’s residuals, we will conduct the Anderson-Darling normality test, especially relevant given our large sample size.

## 
##  Anderson-Darling normality test
## 
## data:  linear_model_initial$residuals
## A = 578.96, p-value < 2.2e-16

If the p-value from the Anderson-Darling test is less than our alpha level of 0.05, we will reject the null hypothesis that the errors are normally distributed.

Breusch-Pagan Test Results

The results from the Breusch-Pagan test are as follows:

  • Test Statistic (BP): 1.0169412^{5}
  • Degrees of Freedom (df): 104
  • P-value: 0

Given that our p-value is less than 0.05, we reject the null hypothesis, indicating that heteroscedasticity is likely present in our residuals.

Anderson-Darling Test Results

The results from the Anderson-Darling test are as follows:

  • A-squared: 578.9560396
  • P-value: 3.7^{-24}

Since the p-value is less than 0.05, we reject the null hypothesis, suggesting that errors are not normally distributed.

Outlier Detection

To further investigate nonlinearity, constant variance of errors, and outliers, we examine the data points with potentially high influence.

Residuals vs Leverage Plot

We use the ols_plot_resid_lev function to create a plot for detecting outliers and observations with high leverage.

  • Observation: Look for observations that have high leverage (extreme values on the right side of the plot) and potentially strong influence on the model.

Residuals vs Fitted Plot

We use the ols_plot_resid_stud_fit function to create a plot that helps detect non-linearity, constant variances, and outliers in residuals.

  • Observation:

Cook’s Distance

We calculate Cook’s distance for each observation to identify influential data points.

  • Observation:

Filtering Outliers

We filter the data points with Cook’s distance >= 0.04, which is a reasonable threshold for identifying influential data points.

##   235   432   867  3405  6217  9381 11944 13074 
##   235   432   867  3405  6217  9381 11944 13074
  • Conclusion:
# Remove outliers from both datasets
train_df_linear_outliers <- train_df_linear[-influential_obs, ]
train_df_non_linear_outliers <- train_df_non_linear[-influential_obs, ]
## 
## Call:
## lm(formula = price ~ ., data = train_df_linear_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1470087   -59397      384    53107  2001352 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.026e+08  9.254e+06 -11.084  < 2e-16 ***
## bedrooms                -6.961e+03  1.606e+03  -4.334 1.47e-05 ***
## bathrooms                2.273e+04  2.743e+03   8.286  < 2e-16 ***
## sqft_living              1.486e+02  3.270e+00  45.454  < 2e-16 ***
## sqft_lot                 2.329e-01  4.047e-02   5.754 8.89e-09 ***
## floors                  -2.440e+04  3.328e+03  -7.332 2.38e-13 ***
## sqft_basement           -4.763e+01  3.826e+00 -12.449  < 2e-16 ***
## yr_built                -2.787e+02  6.833e+01  -4.079 4.55e-05 ***
## yr_renovated             2.500e+01  3.088e+00   8.095 6.15e-16 ***
## sqft_living15            1.810e+01  3.036e+00   5.962 2.55e-09 ***
## sqft_lot15              -6.871e-02  6.297e-02  -1.091 0.275252    
## year_sold                5.247e+04  4.591e+03  11.429  < 2e-16 ***
## day_sold                -2.644e+02  1.372e+02  -1.927 0.054060 .  
## day_of_year              1.483e+02  5.141e+01   2.885 0.003919 ** 
## zipcode_98001           -2.310e+05  1.688e+04 -13.691  < 2e-16 ***
## zipcode_98002           -2.277e+05  1.859e+04 -12.250  < 2e-16 ***
## zipcode_98003           -2.296e+05  1.733e+04 -13.250  < 2e-16 ***
## zipcode_98004            2.684e+05  1.504e+04  17.846  < 2e-16 ***
## zipcode_98005           -1.591e+05  1.668e+04  -9.540  < 2e-16 ***
## zipcode_98006           -1.765e+05  1.247e+04 -14.152  < 2e-16 ***
## zipcode_98007           -2.029e+05  1.800e+04 -11.268  < 2e-16 ***
## zipcode_98008           -1.755e+05  1.444e+04 -12.154  < 2e-16 ***
## zipcode_98010           -1.658e+05  2.245e+04  -7.384 1.62e-13 ***
## zipcode_98011           -2.420e+05  1.498e+04 -16.152  < 2e-16 ***
## zipcode_98014           -1.857e+05  1.903e+04  -9.756  < 2e-16 ***
## zipcode_98019           -2.177e+05  1.683e+04 -12.940  < 2e-16 ***
## zipcode_98022           -1.357e+05  2.313e+04  -5.868 4.50e-09 ***
## zipcode_98023           -2.485e+05  1.674e+04 -14.840  < 2e-16 ***
## zipcode_98024           -1.692e+05  2.155e+04  -7.854 4.30e-15 ***
## zipcode_98027           -1.999e+05  1.287e+04 -15.537  < 2e-16 ***
## zipcode_98028           -2.457e+05  1.411e+04 -17.416  < 2e-16 ***
## zipcode_98029           -1.429e+05  1.349e+04 -10.598  < 2e-16 ***
## zipcode_98030           -2.809e+05  1.590e+04 -17.663  < 2e-16 ***
## zipcode_98031           -2.987e+05  1.481e+04 -20.175  < 2e-16 ***
## zipcode_98032           -2.907e+05  1.911e+04 -15.208  < 2e-16 ***
## zipcode_98033           -8.864e+04  1.302e+04  -6.806 1.04e-11 ***
## zipcode_98034           -2.069e+05  1.209e+04 -17.106  < 2e-16 ***
## zipcode_98038           -2.244e+05  1.484e+04 -15.127  < 2e-16 ***
## zipcode_98039            7.076e+05  2.738e+04  25.838  < 2e-16 ***
## zipcode_98040            3.304e+04  1.447e+04   2.284 0.022373 *  
## zipcode_98042           -2.703e+05  1.437e+04 -18.813  < 2e-16 ***
## zipcode_98045           -1.006e+05  1.966e+04  -5.118 3.12e-07 ***
## zipcode_98052           -1.775e+05  1.210e+04 -14.675  < 2e-16 ***
## zipcode_98053           -1.614e+05  1.318e+04 -12.252  < 2e-16 ***
## zipcode_98055           -3.200e+05  1.423e+04 -22.492  < 2e-16 ***
## zipcode_98056           -3.097e+05  1.285e+04 -24.104  < 2e-16 ***
## zipcode_98058           -3.170e+05  1.284e+04 -24.698  < 2e-16 ***
## zipcode_98059           -2.989e+05  1.264e+04 -23.647  < 2e-16 ***
## zipcode_98065           -1.883e+05  1.565e+04 -12.034  < 2e-16 ***
## zipcode_98070           -2.590e+05  2.058e+04 -12.583  < 2e-16 ***
## zipcode_98072           -2.154e+05  1.414e+04 -15.231  < 2e-16 ***
## zipcode_98074           -2.213e+05  1.263e+04 -17.513  < 2e-16 ***
## zipcode_98075           -2.190e+05  1.320e+04 -16.593  < 2e-16 ***
## zipcode_98077           -2.324e+05  1.564e+04 -14.860  < 2e-16 ***
## zipcode_98092           -2.467e+05  1.743e+04 -14.151  < 2e-16 ***
## zipcode_98102            2.709e+04  1.972e+04   1.373 0.169713    
## zipcode_98103           -6.633e+04  1.182e+04  -5.611 2.04e-08 ***
## zipcode_98105            3.841e+04  1.453e+04   2.643 0.008234 ** 
## zipcode_98106           -2.654e+05  1.325e+04 -20.029  < 2e-16 ***
## zipcode_98107           -4.602e+04  1.383e+04  -3.329 0.000875 ***
## zipcode_98108           -3.008e+05  1.596e+04 -18.848  < 2e-16 ***
## zipcode_98109            9.279e+04  1.789e+04   5.187 2.16e-07 ***
## zipcode_98112            1.809e+05  1.429e+04  12.659  < 2e-16 ***
## zipcode_98115           -8.394e+04  1.188e+04  -7.064 1.69e-12 ***
## zipcode_98116           -9.858e+04  1.309e+04  -7.532 5.29e-14 ***
## zipcode_98117           -6.968e+04  1.198e+04  -5.818 6.07e-09 ***
## zipcode_98118           -2.708e+05  1.236e+04 -21.917  < 2e-16 ***
## zipcode_98119            9.318e+04  1.568e+04   5.942 2.87e-09 ***
## zipcode_98122           -1.149e+05  1.430e+04  -8.035 1.00e-15 ***
## zipcode_98125           -1.979e+05  1.262e+04 -15.688  < 2e-16 ***
## zipcode_98126           -1.975e+05  1.297e+04 -15.228  < 2e-16 ***
## zipcode_98133           -2.118e+05  1.225e+04 -17.290  < 2e-16 ***
## zipcode_98136           -1.298e+05  1.404e+04  -9.249  < 2e-16 ***
## zipcode_98144           -1.660e+05  1.351e+04 -12.281  < 2e-16 ***
## zipcode_98146           -2.707e+05  1.363e+04 -19.862  < 2e-16 ***
## zipcode_98148           -2.593e+05  2.384e+04 -10.876  < 2e-16 ***
## zipcode_98155           -2.196e+05  1.259e+04 -17.444  < 2e-16 ***
## zipcode_98166           -2.707e+05  1.433e+04 -18.888  < 2e-16 ***
## zipcode_98168           -3.239e+05  1.420e+04 -22.812  < 2e-16 ***
## zipcode_98177           -1.264e+05  1.395e+04  -9.058  < 2e-16 ***
## zipcode_98178           -3.563e+05  1.421e+04 -25.070  < 2e-16 ***
## zipcode_98188           -3.171e+05  1.701e+04 -18.640  < 2e-16 ***
## zipcode_98198           -2.996e+05  1.541e+04 -19.449  < 2e-16 ***
## view_1                   8.658e+04  9.301e+03   9.308  < 2e-16 ***
## view_2                   6.953e+04  5.768e+03  12.055  < 2e-16 ***
## view_3                   1.312e+05  7.883e+03  16.650  < 2e-16 ***
## view_4                   2.921e+05  1.184e+04  24.678  < 2e-16 ***
## condition_2             -1.699e+04  1.285e+04  -1.323 0.186003    
## condition_4              2.935e+04  2.958e+03   9.923  < 2e-16 ***
## condition_5              8.004e+04  4.692e+03  17.059  < 2e-16 ***
## grade_3                 -1.487e+06  1.213e+05 -12.258  < 2e-16 ***
## grade_4                 -1.595e+06  7.799e+04 -20.453  < 2e-16 ***
## grade_5                 -1.631e+06  7.154e+04 -22.802  < 2e-16 ***
## grade_6                 -1.636e+06  7.066e+04 -23.150  < 2e-16 ***
## grade_7                 -1.627e+06  7.038e+04 -23.112  < 2e-16 ***
## grade_8                 -1.598e+06  7.019e+04 -22.763  < 2e-16 ***
## grade_9                 -1.516e+06  7.001e+04 -21.655  < 2e-16 ***
## grade_10                -1.387e+06  6.993e+04 -19.840  < 2e-16 ***
## grade_11                -1.181e+06  7.005e+04 -16.862  < 2e-16 ***
## grade_12                -8.417e+05  7.165e+04 -11.749  < 2e-16 ***
## waterfront_0            -5.090e+05  1.662e+04 -30.619  < 2e-16 ***
## season_Spring            1.475e+04  5.066e+03   2.912 0.003599 ** 
## season_Summer            5.712e+03  8.637e+03   0.661 0.508370    
## season_Fall             -8.989e+03  1.269e+04  -0.708 0.478786    
## distance_to_convergence -7.798e+03  5.322e+02 -14.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138500 on 15016 degrees of freedom
## Multiple R-squared:  0.8449, Adjusted R-squared:  0.8438 
## F-statistic: 786.5 on 104 and 15016 DF,  p-value: < 2.2e-16

Neural Network Model

# Set to TRUE to run the actual NN
update_model_parameters <- FALSE

# Define a normalization function
normalize_data <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

# Function to prepare the dataset
prepare_dataset <- function(df) {
  df$waterfront <- as.numeric(df$waterfront_1)
  selected_columns <- setdiff(names(df), c("season", "price_cat", "waterfront_0", "waterfront_1"))
  df_normalized <- as.data.frame(lapply(df[selected_columns], normalize_data))
  return(df_normalized)
}

# Function to rescale normalized data back to original scale
rescale_to_original <- function(predictions, mean, sd) {
  return (predictions * sd + mean)
}

# Define model builder
build_model <- function(hp) {
  model <- keras_model_sequential() %>%
    layer_dense(units = hp$Int('units_1', min_value = 32, max_value = 128, step = 32),
                activation = 'relu', input_shape = ncol(train_data) - 1) %>%
    layer_dense(units = hp$Int('units_2', min_value = 16, max_value = 64, step = 16),
                activation = 'relu') %>%
    # Add more layers if needed
    layer_dense(units = 1)

  model %>% compile(
    optimizer = optimizer_adam(hp$Float('learning_rate', 1e-4, 1e-2, sampling = 'log')),
    loss = 'mean_squared_error',
    metrics = c('mean_absolute_error')
  )
  return(model)
}

train_and_evaluate_keras_nn <- function(train_data, test_data, original_train_price, original_test_price) {
  # Initialize variables to store the best model and its lowest validation MAE
  best_model <- NULL
  lowest_val_mae <- Inf
  best_batch_size <- NULL  # Store the best batch size
  best_epochs <- NULL  # Store the best number of epochs
  batch_sizes <- c(32, 64, 128, 256, 512)  # Different batch sizes to experiment with
  epochs_list <- c(50, 100, 150, 200, 250)  # Different numbers of epochs to experiment with

  for (batch_size in batch_sizes) {
    for (epochs in epochs_list) {
      # Define a complex Keras model with multiple layers
      model <- keras_model_sequential() %>%
        layer_dense(units = 256, activation = 'relu', input_shape = ncol(train_data) - 1) %>%
        layer_dropout(rate = 0.3) %>%
        layer_dense(units = 128, activation = 'relu') %>%
        layer_dropout(rate = 0.3) %>%
        layer_dense(units = 64, activation = 'relu') %>%
        layer_dense(units = 32, activation = 'relu') %>%
        layer_dense(units = 16, activation = 'relu') %>%
        layer_dense(units = 8, activation = 'relu') %>%
        layer_dense(units = 4, activation = 'relu') %>%
        layer_dense(units = 2, activation = 'relu') %>%
        layer_dense(units = 1)

      # Compile the model
      model %>% compile(
        loss = 'mean_squared_error',
        optimizer = 'adam',
        metrics = c('mean_absolute_error')
      )

      # Fit the model to training data
      history <- model %>% fit(
        x = as.matrix(train_data[, -which(names(train_data) == "price")]),
        y = train_data$price,
        epochs = epochs,
        batch_size = batch_size,
        validation_split = 0.2  # Use 20% of training data for validation
      )

      # Check the performance on the validation set
      val_mae <- min(history$metrics$val_mean_absolute_error)
      if (val_mae < lowest_val_mae) {
        lowest_val_mae <- val_mae
        best_model <- model
        best_batch_size <- batch_size
        best_epochs <- epochs
      }
    }
  }

  # Print the best hyperparameters and their correlation
  cat("Best Batch Size:", best_batch_size, "\n")
  cat("Best Number of Epochs:", best_epochs, "\n")

  # Make predictions on the training and test data
  train_predictions <- predict(best_model, x = as.matrix(train_data[, -which(names(train_data) == "price")]))
  test_predictions <- predict(best_model, x = as.matrix(test_data[, -which(names(test_data) == "price")]))

  # Rescale predictions back to original scale
  rescaled_train_predictions <- rescale_to_original(train_predictions, mean(original_train_price), sd(original_train_price))
  rescaled_test_predictions <- rescale_to_original(test_predictions, mean(original_test_price), sd(original_test_price))

  # Calculate various performance metrics
  metrics <- list(
    mae_train = mean(abs(rescaled_train_predictions - original_train_price)),
    mae_test = mean(abs(rescaled_test_predictions - original_test_price)),
    RMSE_train = sqrt(mean((rescaled_train_predictions - original_train_price)^2)),
    RMSE_test = sqrt(mean((rescaled_test_predictions - original_test_price)^2)),
    SSE_train = sum((rescaled_train_predictions - original_train_price)^2),
    SSE_test = sum((rescaled_test_predictions - original_test_price)^2),
    correlation_train = cor(rescaled_train_predictions, original_train_price),
    correlation_test = cor(rescaled_test_predictions, original_test_price),
    r_squared_train = cor(rescaled_train_predictions, original_train_price)^2,
    r_squared_test = cor(rescaled_test_predictions, original_test_price)^2
  )

  print(paste("Model Correlation:", metrics$correlation_test))

  return(metrics)
}

# Load and prepare data
train_df_logis <- train_df_non_linear_outliers
test_df_logis <- test_df_non_linear

train_norm <- prepare_dataset(train_df_logis)
test_norm <- prepare_dataset(test_df_logis)

# Original non-normalized price data
original_train_price <- train_df_non_linear_outliers$price
original_test_price <- test_df_non_linear$price

# Train and evaluate the model
if (update_model_parameters) {
  library(keras)
  library(reticulate)
  # Import python as the backend for using Keras
  use_python(python = "C:\\Users\\Charl\\AppData\\Local\\Programs\\Python\\Python39\\python.exe", required = TRUE)
  # Fetch metrics
  nn_metrics <- train_and_evaluate_keras_nn(train_norm, test_norm, original_train_price, original_test_price)
  # Code to update model parameters in JSON
  update_model_json("NeuralNetwork", nn_metrics, json_filepath)
} else {
  # Code to load model parameters from JSON
  model_params <- fromJSON(json_filepath)
  nn_metrics <- model_params$NeuralNetwork
}

# Prepare results for the dataframe
nn_model_row <- data.frame(
  Model = "Neural Network",
  SSE_train = nn_metrics$SSE_train,
  SSE_test = nn_metrics$SSE_test,
  R_squared_train = nn_metrics$r_squared_train,
  R_squared_test = nn_metrics$r_squared_test,
  RMSE_train = nn_metrics$RMSE_train,
  RMSE_test = nn_metrics$RMSE_test,
  MAE_train = nn_metrics$mae_train,
  MAE_test = nn_metrics$mae_test,
  stringsAsFactors = FALSE
)

# Append the new row to df_results
df_results <- rbind(df_results, nn_model_row)
view_model_results(df_results)

Regression Tree

# Create data frames for regression tree analysis
# Remove 'price_cat' and keep 'price' in the datasets
train_df_reg <- train_df_logis[, !names(train_df_logis) %in% "price_cat"]
test_df_reg <- test_df_logis[, !names(test_df_logis) %in% "price_cat"]

# Define a broader range of hyperparameters to search
tuneGrid <- expand.grid(
  cp = seq(0.001, 0.1, by = 0.001)
)

# Create a custom tuning grid for rpart
customControl <- trainControl(
  method = "cv",
  number = 5,
  search = "grid",
  verboseIter = FALSE
)

# Perform hyperparameter tuning with custom control
model <- train(
  price ~ .,
  data = train_df_reg,
  method = "rpart",
  trControl = customControl,
  tuneGrid = tuneGrid
)

# Make predictions
p.rpart_train <- predict(model, newdata = train_df_reg)
p.rpart_test <- predict(model, newdata = test_df_reg)

# Calculate metrics for training dataset
MAE_train <- mae(train_df_reg$price, p.rpart_train)
SSE_train <- sum((train_df_reg$price - p.rpart_train)^2)
R_squared_train <- R2(train_df_reg$price, p.rpart_train)
RMSE_train <- rmse(train_df_reg$price, p.rpart_train)

# Calculate metrics for testing dataset
MAE_test <- mae(test_df_reg$price, p.rpart_test)
SSE_test <- sum((test_df_reg$price - p.rpart_test)^2)
R_squared_test <- R2(test_df_reg$price, p.rpart_test)
RMSE_test <- rmse(test_df_reg$price, p.rpart_test)

# Append the results to df_results
df_results <- rbind(df_results, data.frame(
  Model = "Regression Tree with Tuning",
  SSE_train = SSE_train,
  SSE_test = SSE_test,
  R_squared_train = R_squared_train,
  R_squared_test = R_squared_test,
  RMSE_train = RMSE_train,
  RMSE_test = RMSE_test,
  MAE_train = MAE_train,
  MAE_test = MAE_test,
  stringsAsFactors = FALSE
))

# View Model Results
view_model_results(df_results)

Logistic Regression

# update_model_parameters <- TRUE
#
# if (update_model_parameters) {
#     # Calculate mean price for each zipcode
#     calc.mean.price <- function(col, df) {
#       df.select <- df[df[,col] == 1,]
#       mean.price <- mean(df.select$price)
#       return(mean.price)
#     }
#
#     # Adding mean price for each zipcode to the datasets
#     indx.zipcode <- grepl('zipcode', colnames(train_df_logis))
#     colname.zipcode <- colnames(train_df_logis)[indx.zipcode]
#     train_df_logis$zipcode.meanprice <- 0.0
#     test_df_logis$zipcode.meanprice <- 0.0
#
#     for (z in colname.zipcode) {
#       mean_price_train <- calc.mean.price(z, train_df_logis)
#       mean_price_test <- calc.mean.price(z, test_df_logis)
#
#       train_rows_to_update <- train_df_logis[, z] == 1
#       test_rows_to_update <- test_df_logis[, z] == 1
#
#       if (any(train_rows_to_update)) {
#         train_df_logis$zipcode.meanprice[train_rows_to_update] <- mean_price_train
#       } else {
#         warning(paste("No rows to update for train data in column:", z))
#       }
#
#       if (any(test_rows_to_update)) {
#         test_df_logis$zipcode.meanprice[test_rows_to_update] <- mean_price_test
#       } else {
#         warning(paste("No rows to update for test data in column:", z))
#       }
#       # Check the number of rows in train_df_logis
#       if (nrow(train_df_logis) == 0) {
#         stop("train_df_logis became empty after adding mean prices.")
#       }
#     }
#
#     # Removing unnecessary columns
#     train_df_logis <- train_df_logis[, !indx.zipcode]
#     train_df_logis <- train_df_logis[ , -which(names(train_df_logis) %in% c("lat", "long"))]
#
#     # Check the number of rows in train_df_logis
#     if (nrow(train_df_logis) == 0) {
#       stop("train_df_logis became empty after removing unnecessary columns.")
#     }
#
#     test_df_logis <- test_df_logis[, !indx.zipcode]
#     test_df_logis <- test_df_logis[ , -which(names(test_df_logis) %in% c("lat", "long"))]
#
#     outliers <- c("13244", "3098", "9986", "5127", "2627", "8814", "643")
#     # Handling Outliers
#     if (any(rownames(train_df_logis) %in% outliers)) {
#         train_df_logis <- train_df_logis[!rownames(train_df_logis) %in% outliers,]
#     } else {
#         warning("No outliers found or row names do not match in train_df_logis.")
#     }
#
#     # Check the number of rows in train_df_logis
#     if (nrow(train_df_logis) == 0) {
#       stop("train_df_logis became empty after handling outliers.")
#     }
#
#     # Creating a categorical response variable based on median price
#     price_med <- median(train_df_logis$price)
#     if (nrow(train_df_logis) > 0) {
#       train_df_logis$price_cat <- as.factor(ifelse(train_df_logis$price < price_med, 0, 1))
#     } else {
#       stop("train_df_logis is empty. Cannot create price_cat.")
#     }
#
#
#     if (nrow(test_df_logis) > 0) {
#       test_df_logis$price_cat <- as.factor(ifelse(test_df_logis$price < price_med.test, 0, 1))
#     } else {
#       stop("test_df_logis is empty. Cannot create price_cat.")
#     }
#
#     # Fit a logistic regression model
#     lmod <- glm(price_cat ~ . -price, family = binomial, data = train_df_logis)
#
#     # Perform stepwise model selection to drop non-significant variables
#     lmodr <- step(lmod, trace = 0)
#
#     # Use the 'drop1' function to assess variable importance and drop non-significant variables
#     # Continue this process until all variables are significant
#     while(any(drop1(lmodr, test = "Chi")$`Pr(>Chi)` > 0.05)) {
#       variable_to_drop <- names(which.max(drop1(lmodr, test = "Chi")$`Pr(>Chi)`))
#       lmodr <- update(lmodr, as.formula(paste(".~.-", variable_to_drop)))
#     }
#
#     # Use the final model
#     lmodr.final <- lmodr
#
#     # Predict probabilities and classify based on the threshold
#     predicted_probabilities <- predict(lmodr.final, train_df_logis, type = "response")
#     predictions <- ifelse(predicted_probabilities > 0.45, 1, 0)
#
#     # Create a confusion matrix for training data
#     confusion_matrix <- confusionMatrix(as.factor(predictions), as.factor(train_df_logis$price_cat), mode = "prec_recall", positive = "1")
#
#     # Extract relevant metrics from the confusion matrix
#     sensitivity <- confusion_matrix$byClass['Sensitivity']
#     specificity <- confusion_matrix$byClass['Specificity']
#     precision <- confusion_matrix$byClass['Precision']
#     F1 <- confusion_matrix$byClass['F1']
#
#     # Construct the list of model metrics for logistic regression
#     logistic_regression_metrics <- list(
#         sensitivity = sensitivity,
#         specificity = specificity,
#         precision = precision,
#         F1 = F1,
#         model_coefficients = summary(lmodr.final)$coefficients,
#         AIC = AIC(lmodr.final),
#         BIC = BIC(lmodr.final)
#     )
#
#     # Update the model parameters in the JSON file
#     update_model_json("LogisticRegression", logistic_regression_metrics, json_filepath)
#
# } else {
#     # Load model parameters from JSON
#     model_params <- fromJSON(json_filepath)
#     logistic_regression_metrics <- model_params$LogisticRegression
# }
#
#
# # Create a new dataframe for logistic regression results
# df_results_categorical <- data.frame(
#     Model = "Logistic Regression",
#     Sensitivity = logistic_regression_metrics$sensitivity,
#     Specificity = logistic_regression_metrics$specificity,
#     Precision = logistic_regression_metrics$precision,
#     F1 = logistic_regression_metrics$F1,
#     AIC = logistic_regression_metrics$AIC,
#     BIC = logistic_regression_metrics$BIC,
#     stringsAsFactors = FALSE
# )
#
# # Append coefficients to df_results_categorical if necessary
# coefficients_df <- as.data.frame(logistic_regression_metrics$model_coefficients)
# df_results_categorical <- cbind(df_results_categorical, coefficients_df)
#
# # Print the results dataframe
# view_model_results(df_results_categorical)

Decision Tree

# # Create data frames for decision tree analysis
# train_df_dec <- train_df_logis
# test_df_dec <- test_df_logis
#
# # Create a categorical response variable: 1=higher than 200,000, 0=lower than 200,000
# train_df_dec$price_low <- 1
# train_df_dec[train_df_dec$price > 200000,]$price_low <- 0
# train_df_dec$price_low <- as.factor(train_df_dec$price_low)
# table(train_df_dec$price_low)
#
# test_df_dec$price_low <- 1
# test_df_dec[test_df_dec$price > 200000,]$price_low <- 0
# test_df_dec$price_low <- as.factor(test_df_dec$price_low)
# table(test_df_dec$price_low)
#
# # Remove 'price' and 'price_cat' columns
# train_df_dec <- train_df_dec[, -which(names(train_df_dec) %in% c("price", "price_cat"))]
# test_df_dec <- test_df_dec[, -which(names(test_df_dec) %in% c("price", "price_cat"))]
#
# # Load the 'C50' library for decision tree modeling
# library(C50)
#
# # Fit a decision tree model
# price_model <- C5.0(train_df_dec, train_df_dec$price_low)
# price_model
#
# # Evaluate the model's performance
# summary(price_model)
#
# # Visualize the decision tree
# plot(price_model)
#
# # Make predictions on the test data
# price_pred <- predict(price_model, test_df_dec)
#
# # Evaluate the model's performance using a confusion matrix
# library(gmodels)
# CrossTable(test_df_dec$price_low, price_pred,
#            prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
#            dnn = c('actual default', 'predicted default'))
#
# # Boosting
# price_boost10 <- C5.0(train_df_dec, train_df_dec$price_low, trials = 10)
# summary(price_boost10)